Skip to content

ARROW-9528: [Python] Honor tzinfo when converting from datetime#7816

Closed
kszucs wants to merge 25 commits into
apache:masterfrom
kszucs:tz
Closed

ARROW-9528: [Python] Honor tzinfo when converting from datetime#7816
kszucs wants to merge 25 commits into
apache:masterfrom
kszucs:tz

Conversation

@kszucs

@kszucs kszucs commented Jul 21, 2020

Copy link
Copy Markdown
Member

Follow up of:

TODOs:

  • Store all Timestamp values normalized to UTC
  • Infer timezone from the array values if no explicit type was given
  • Testing (especially pandas object roundtrip)
  • Testing of timezone-naive roundtrips
  • Testing mixed pandas and datetime objects

@kszucs

kszucs commented Jul 21, 2020

Copy link
Copy Markdown
Member Author

@emkornfield I've cherry picked ARROW-9223, ARROW-9528, addressed Antoine's reviews comments and simplified the conversion Timestamp conversion path a bit. I'm going to add timezone support for the Time types as well.

@emkornfield

Copy link
Copy Markdown
Contributor

Thanks @kszucs per discussion I'm ok with this not making it into 1.0 release. I am working on making moving tzinfo_to_string to c++ so we can accurately capture tz instead of going to utc. I also still think a flag is a good idea to revert to buggy behavior. Want me to post here?

@kszucs

kszucs commented Jul 21, 2020

Copy link
Copy Markdown
Member Author

@emkornfield sure, please feel free to push to this PR. I'm going to work on the Time type support then.

@kszucs

kszucs commented Jul 22, 2020

Copy link
Copy Markdown
Member Author

@emkornfield I managed to implement TzinfoToString on the C++ side. Currently adding timezone inference support.

Comment thread python/pyarrow/tests/test_convert_builtin.py Outdated
@kszucs kszucs changed the title [Python] Better timezone support ARROW-9528: [Python] Honor tzinfo when converting from datetime Jul 22, 2020
@github-actions

Copy link
Copy Markdown

Comment thread cpp/src/arrow/compute/kernels/scalar_string.cc Outdated
Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
Comment thread cpp/src/arrow/python/datetime.h Outdated
Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated
Comment thread python/pyarrow/tests/test_pandas.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might not be true anymore with StringToTz?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it seems like pa.array(result) should recover struct and we want to verify that?

Comment thread python/pyarrow/tests/test_types.py Outdated

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you for the thorough tests.

Comment thread cpp/src/arrow/python/common.h Outdated
@emkornfield

Copy link
Copy Markdown
Contributor

@kszucs I pushed a few commits to:

  1. Fix typos I found on rereview.
  2. Plumb through turning off the use of TZ informaiton in C++ via options.
  3. Population in python via ARROW_NO_TZ environment variable

I'd appreciate any thoughts you have on testing the backwards compatibility (maybe it is sufficient to use this with spark integration tests?)

@kszucs

kszucs commented Aug 2, 2020

Copy link
Copy Markdown
Member Author

Thanks Micah!

Testing it via the spark integration test SGTM. If we think that ignoring timezones during conversion could be a useful feature we could even expose it properly on the python side.

@emkornfield

Copy link
Copy Markdown
Contributor

I'll let others chime in but I think this PR captures correct behavior, so I'd like to deprecate backwards compatibility after a release or two

@kszucs

kszucs commented Aug 2, 2020

Copy link
Copy Markdown
Member Author

I mean ignore_tomezone could be a useful conversion option for mixed timezone aware/naive input values (should be rare or rather discouraged though).

I agree with the deprecation plan. I'm on vacation next week, but hopefully @pitrou and @jorisvandenbossche will be able to take a look at it.

@emkornfield

Copy link
Copy Markdown
Contributor

@github-actions crossbow submit test-conda-python-3.8-spark-master

@github-actions

github-actions Bot commented Aug 3, 2020

Copy link
Copy Markdown

Revision: e27e86cdc83dede4bb70c49ac5af2fb3a0494c51

Submitted crossbow builds: ursa-labs/crossbow @ actions-455

Task Status
test-conda-python-3.8-spark-master Github Actions

@emkornfield

Copy link
Copy Markdown
Contributor

@github-actions crossbow submit test-conda-python-3.8-spark-master

@github-actions

github-actions Bot commented Aug 3, 2020

Copy link
Copy Markdown

Revision: b4e3b9d4dcf3da6c7fa3e652d5b77fc4e7abad13

Submitted crossbow builds: ursa-labs/crossbow @ actions-458

Task Status
test-conda-python-3.8-spark-master Github Actions

@emkornfield

Copy link
Copy Markdown
Contributor

@github-actions crossbow submit test-conda-python-3.7-spark-branch-3.0

@github-actions

github-actions Bot commented Aug 3, 2020

Copy link
Copy Markdown

Revision: b4e3b9d4dcf3da6c7fa3e652d5b77fc4e7abad13

Submitted crossbow builds: ursa-labs/crossbow @ actions-462

Task Status
test-conda-python-3.7-spark-branch-3.0 Github Actions

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few scattered comments. I need to pull this down and kick the tires a bit to make sure that the new behavior matches my expectations

Comment thread ci/scripts/integration_spark.sh Outdated
Comment thread cpp/src/arrow/compute/kernels/scalar_string.cc Outdated
Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
Comment thread cpp/src/arrow/python/common.h Outdated
Comment thread cpp/src/arrow/python/datetime.h Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do these need to be exported?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, the other utility functions are exported as well.

Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit ugly, could we pass an options struct instead?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My thought here was this parameter will only be need until the next release, so it would be better to add it here and then remove it without the need to introduce options. We can introduce options though if you feel strongly.

Comment thread cpp/src/arrow/python/python_to_arrow.cc Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

be consistent about const?

@emkornfield emkornfield Aug 15, 2020

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, probably just late, but not sure what you mean here.

Comment thread python/pyarrow/tests/test_pandas.py Outdated

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed it seems like pa.array(result) should recover struct and we want to verify that?

- Ports string_to_timezone to C++
- Causes nested timestamp columns within structs that have timezones to go through
  object conversion
- Copy timezone on to_object path.

Open to other suggestions on how to solve this.  It still seems like this is inconsistent with nested timestamps in lists which isn't great.

Closes apache#7604 from emkornfield/datetime

Lead-authored-by: Micah Kornfield <emkornfield@gmail.com>
Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
Signed-off-by: Wes McKinney <wesm@apache.org>

@pitrou pitrou left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that @wesm already commented extensively on this, I've only looked at a small part of this PR.

Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread cpp/src/arrow/python/datetime.cc Outdated
Comment thread python/pyarrow/tests/test_types.py Outdated
Comment thread python/pyarrow/tests/test_types.py Outdated
Comment thread cpp/src/arrow/python/arrow_to_pandas.cc Outdated
@emkornfield

Copy link
Copy Markdown
Contributor

@wesm @kszucs thoughts on any remaining items on this?

@wesm

wesm commented Aug 15, 2020

Copy link
Copy Markdown
Member

Sorry for the delay, didn't have a lot of time for development this week. I am going to pull this down and kick the tires a bit this weekend and will merge later today or tomorrow

@wesm wesm left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants